Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable dollar ($) line anchor in regular expressions in find mode #5289

Merged
merged 15 commits into from
Apr 26, 2022

Conversation

NVnavkumar
Copy link
Collaborator

@NVnavkumar NVnavkumar commented Apr 21, 2022

Fixes #4533.

This re-enables support for the end of line anchor ($) in regular expressions. A couple of caveats:

  1. This only enables support in RegexFindMode. A separate issue will need to filed for Replace and Split modes (if necessary).

  2. Apache Spark actually only uses regular expressions in standard mode (not multiline). Which means that we only support the $ as defined in that mode. This will not match the $ in multiline mode.

  3. This code handles how the line terminators are managed around the line anchor. The $ has different matching characteristics when used with line terminator characters in the regular expression. Particularly here are a couple of examples:

  • using the $ anchor with characters other than a line terminator sequence, the dollar then needs to include the optional matching of possible line terminator sequences (defined here in the Line terminators section).
  • using the $ anchor with line terminator characters changes the behavior of the regular expression to potentially force the matching of a specific line terminator, there are 4\ cases to call out here:
    • \r$ - this means to only match the CR before the end of the string, so no need to transpile in this case, just match the end of the string using $ in cuDF, and the strings are equivalent
    • [any other line terminator character including \n]$ - this means match that line terminator character plus optionally any other valid line terminator character before the end of the string $
    • $\n - this means to only match the LF (newline) before the end of the string, and no other line termination sequence. This requires the underlying cuDF to support negative lookahead groups, so this case will fall back to the CPU. (this is because \r\n is a valid line terminator sequence, and this means to forcefully not support that sequence which can only be handled by a negative lookahead group -- See cudf#3100 on lookaheads)
    • [any other line character including \r]$ - this means match that line terminator character plus optionally any other valid line terminator character before the end of the string $
  • multiple $$ in a row are handled by Java by just reducing to 1 $

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
…r line termination characters

Signed-off-by: Navin Kumar <navink@nvidia.com>
… support to transpile. Also, add more comments

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar NVnavkumar self-assigned this Apr 21, 2022
Signed-off-by: Navin Kumar <navink@nvidia.com>
@sameerz sameerz added the feature request New feature or request label Apr 21, 2022
@sameerz sameerz added this to the Apr 18 - Apr 29 milestone Apr 21, 2022
@andygrove
Copy link
Contributor

This is looking good. Should we also include form-feed \f in the tests?

…inator to test whitespace around a line anchor

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar NVnavkumar marked this pull request as ready for review April 25, 2022 18:36
@NVnavkumar
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Re-enable support for $ in regular expressions
3 participants